-
Notifications
You must be signed in to change notification settings - Fork 15.3k
[AMDGPU] Enable i8 GEP promotion for vector allocas #166132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@llvm/pr-subscribers-backend-amdgpu Author: Harrison Hao (harrisonGPU) ChangesThis patch adds support for the pattern: %elt = getelementptr inbounds i8, ptr addrspace(5) %alloca, i32 %indexby scaling the byte offset to an element index (index >> log2(ElemSize)), Full diff: https://github.com/llvm/llvm-project/pull/166132.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
index ddabd25894414..793c0237cdf38 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
@@ -456,10 +456,21 @@ static Value *GEPToVectorIndex(GetElementPtrInst *GEP, AllocaInst *Alloca,
const auto &VarOffset = VarOffsets.front();
APInt OffsetQuot;
APInt::sdivrem(VarOffset.second, VecElemSize, OffsetQuot, Rem);
- if (Rem != 0 || OffsetQuot.isZero())
- return nullptr;
+
+ Value *Scaled = nullptr;
+ if (Rem != 0 || OffsetQuot.isZero()) {
+ unsigned ElemSizeShift = Log2_64(VecElemSize);
+ Scaled = Builder.CreateLShr(VarOffset.first, ElemSizeShift);
+ if (Instruction *NewInst = dyn_cast<Instruction>(Scaled))
+ NewInsts.push_back(NewInst);
+ OffsetQuot = APInt(BW, 1);
+ Rem = 0;
+ }
Value *Offset = VarOffset.first;
+ if (Scaled)
+ Offset = Scaled;
+
auto *OffsetType = dyn_cast<IntegerType>(Offset->getType());
if (!OffsetType)
return nullptr;
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
index 76e1868b3c4b9..65bddaba8dd14 100644
--- a/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-vector-gep.ll
@@ -250,6 +250,26 @@ bb2:
store i32 0, ptr addrspace(5) %extractelement
ret void
}
+
+define amdgpu_kernel void @scalar_alloca_vector_gep_i8(ptr %buffer, float %data, i32 %index) {
+; CHECK-LABEL: define amdgpu_kernel void @scalar_alloca_vector_gep_i8(
+; CHECK-SAME: ptr [[BUFFER:%.*]], float [[DATA:%.*]], i32 [[INDEX:%.*]]) {
+; CHECK-NEXT: [[ALLOCA:%.*]] = freeze <3 x float> poison
+; CHECK-NEXT: [[VEC:%.*]] = load <3 x float>, ptr [[BUFFER]], align 16
+; CHECK-NEXT: [[TMP1:%.*]] = lshr i32 [[INDEX]], 2
+; CHECK-NEXT: [[TMP2:%.*]] = insertelement <3 x float> [[VEC]], float [[DATA]], i32 [[TMP1]]
+; CHECK-NEXT: store <3 x float> [[TMP2]], ptr [[BUFFER]], align 16
+; CHECK-NEXT: ret void
+;
+ %alloca = alloca <3 x float>, align 16, addrspace(5)
+ %vec = load <3 x float>, ptr %buffer
+ store <3 x float> %vec, ptr addrspace(5) %alloca
+ %elt = getelementptr inbounds nuw i8, ptr addrspace(5) %alloca, i32 %index
+ store float %data, ptr addrspace(5) %elt, align 4
+ %updated = load <3 x float>, ptr addrspace(5) %alloca, align 16
+ store <3 x float> %updated, ptr %buffer, align 16
+ ret void
+}
;.
; CHECK: [[META0]] = !{}
; CHECK: [[RNG1]] = !{i32 0, i32 1025}
|
shiltian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the index here is somewhere in the middle? It doesn't look like we have any constraint on the index.
|
|
||
| Value *Scaled = nullptr; | ||
| if (Rem != 0 || OffsetQuot.isZero()) { | ||
| unsigned ElemSizeShift = Log2_64(VecElemSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to validate VecElemSize is a power of 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, but I think it is not necessary to explicitly check whether the element size is a power of two, because it is already covered by the existing check here:
llvm-project/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
Lines 871 to 877 in 52fdcf9
| Type *VecEltTy = VectorTy->getElementType(); | |
| unsigned ElementSizeInBits = DL->getTypeSizeInBits(VecEltTy); | |
| if (ElementSizeInBits != DL->getTypeAllocSizeInBits(VecEltTy)) { | |
| LLVM_DEBUG(dbgs() << " Cannot convert to vector if the allocation size " | |
| "does not match the type's size\n"); | |
| return false; | |
| } |
If the element type is not naturally aligned, it will return false, which also rejects non power of 2 element sizes, such as i24.
| %updated = load <3 x float>, ptr addrspace(5) %alloca, align 16 | ||
| store <3 x float> %updated, ptr %buffer, align 16 | ||
| ret void | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test with non-power-of-2 element size
Could you please give me an example? |
In the test case, the |
c1439a3 to
6a28740
Compare
Hi @shiltian , I’ve already thought about this issue, thank you very much for your suggestion and for pointing it out. |
| return nullptr; | ||
|
|
||
| Value *Offset = VarOffset.first; | ||
| if (Rem != 0 || OffsetQuot.isZero()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it have to be this complicated? I thought checking whether offset % size would be sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I agree your points, I have removed it.
🐧 Linux x64 Test Results
|
This patch adds support for the pattern:
by scaling the byte offset to an element index (index >> log2(ElemSize)),
allowing the vector element to be updated with insertelement instead of using
scratch memory.